Load the data and libraries

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.1.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.1.3
data("diamonds")

Your first task is to create a scatterplot of price vs x, y, or z using the ggplot syntax.

ggplot(aes(x = x, y = price), data = diamonds) + geom_point(alpha = 1/20) + 
  xlim(3.5,9.5) + 
  labs(title = "Diamond Price vs. Length", x = "Length of diamond (mm)",
  y = "Price of diamond (USD)")
## Warning: Removed 20 rows containing missing values (geom_point).

ggplot(aes(x = y, y = price), data = diamonds) + geom_point(alpha = 1/20, color = "blue") + 
  xlim(3.5,10) + 
  labs(title = "Diamond Price vs. Width", x = "Width of diamond (mm)",
       y = "Price of diamond (USD)")
## Warning: Removed 12 rows containing missing values (geom_point).

ggplot(aes(x = z, y = price), data = diamonds) + geom_point(alpha = 1/20, color = "red") + 
  xlim(1,7.5) +
  labs(title = "Diamond Price vs. Height", x = "Height of diamond (mm)",
       y = "Price of diamond (USD)")
## Warning: Removed 22 rows containing missing values (geom_point).

What is the correlation between price and each of X, Y, and Z?

cor(diamonds$x, diamonds$price)
## [1] 0.8844352
cor(diamonds$y, diamonds$price)
## [1] 0.8654209
cor(diamonds$z, diamonds$price)
## [1] 0.8612494

Create a simple scatter plot of price vs depth.

ggplot(aes(x = depth, y = price), data = diamonds) + geom_point() + 
  labs(title = "Diamond Price vs. Depth", x = "Depth (%)",
       y = "Price of diamond (USD)")

Change the code to make the transparency of the points to be 1/100 of what they are now and mark the x-axis every 2 units.

ggplot(aes(x = depth, y = price), data = diamonds) + geom_point(alpha = 1/100, color = "green4") + 
  scale_x_continuous(breaks = seq(0, 100, 2)) + 
  labs(title = "Diamond Price vs. Depth", x = "Depth (%)",
       y = "Price of diamond (USD)")

What is the correlation between price and depth? Would you use this as a predictor?

cor(diamonds$depth, diamonds$price)
## [1] -0.0106474

No, the correlation is almost 0

Create a scatterplot of price vs carat and omit the top 1% of price and carat values.

ggplot(aes(x = carat, y = price), data = diamonds) + geom_point(alpha = 1/50) + 
  xlim(0, quantile(diamonds$carat, 0.99)) +
  ylim(0, quantile(diamonds$price, 0.99)) +
  labs(title = "Diamond Price vs. Carat", x = "Carat",
       y = "Price of diamond (USD)")
## Warning: Removed 926 rows containing missing values (geom_point).

Create a scatterplot of price vs. volume (x * y * z).

ggplot(aes(x = x*y*z, y = price), data = diamonds) + geom_point(alpha = 1/30) +  
  xlim(0, 600) +
  labs(title = "Diamond Price vs. Volume", x = "Diamond Volume (mm^3)",
       y = "Price of diamond (USD)")
## Warning: Removed 9 rows containing missing values (geom_point).

What is the correlation between price and volume?

diamonds$diamonds_volume <- diamonds$x * diamonds$y * diamonds$z
cor(diamonds$diamonds_volume, diamonds$price)
## [1] 0.9023845

About 90%

Subset the data to exclude diamonds with a volume greater than or equal to 800. Also, exclude diamonds with a volume of 0. Adjust the transparency of the points and add a linear model to the plot.

ggplot(aes(x = diamonds_volume, y = price), data = diamonds[diamonds$diamonds_volume < 800 & diamonds$diamonds_volume > 0]) + 
  geom_point(alpha = 1/30) +  
  xlim(0, 600) +
  labs(title = "Diamond Price vs. Volume", x = "Diamond Volume (mm^3)",
       y = "Price of diamond (USD)")
## Warning: Removed 9 rows containing missing values (geom_point).

Do you think this would be a useful model to estimate the price of diamonds? Why or why not?

Yes the correlation is ~90%

Use the function dplyr package to create a new data frame containing info on diamonds by clarity. Name the data frame diamondsByClarity. The data frame should contain the following variables in this order.

   (1) mean_price
   (2) median_price
   (3) min_price
   (4) max_price
   (5) n where n is the number of diamonds in each level of clarity.

Subset the data for price and clarity

diamondsByClarity <- diamonds %>% select(price, clarity)

Summarize descriptive statistics by price or clarity

diamondsByClarity2 <- diamondsByClarity %>% group_by(clarity) %>%
                      summarise(mean_price = mean(price), median_price = median(price), 
                          min_price = min(price), max_price = max(price), n = n())

print(diamondsByClarity2)
## Source: local data frame [8 x 6]
## 
##   clarity mean_price median_price min_price max_price     n
##    (fctr)      (dbl)        (dbl)     (int)     (int) (int)
## 1      I1   3924.169         3344       345     18531   741
## 2     SI2   5063.029         4072       326     18804  9194
## 3     SI1   3996.001         2822       326     18818 13065
## 4     VS2   3924.989         2054       334     18823 12258
## 5     VS1   3839.455         2005       327     18795  8171
## 6    VVS2   3283.737         1311       336     18768  5066
## 7    VVS1   2523.115         1093       336     18777  3655
## 8      IF   2864.839         1080       369     18806  1790